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ABSTRACT 

Accurate prediction of transcription factor binding 
sites (TFBSs) is a prerequisite for identifying c/s- 
regulatory modules that underlie transcriptional 
regulatory circuits encoded in the genome. Here, 
we present a computational framework for detect- 
ing TFBSs, when multiple position weight matrices 
(PWMs) for a transcription factor are available. 
Grouping multiple PWMs of a transcription factor 
(TF) based on their sequence similarity improves 
the specificity of TFBS prediction, which was eva- 
luated using multiple genome-wide ChlP-Seq data 
sets from 26 TFs. The Z-scores of the area under a 
receiver operating characteristic curve (AUC) values 
of 368 TFs were calculated and used to statistically 
identify co-occurring regulatory motifs in the TF 
bound ChIP loci. Motifs that are co-occurring 
along with the empirical bindings of E2F, JUN or 
MYC have been evaluated, in the basal or stimulated 
condition. Results prove our method can be useful 
to systematically identify the co-occurring motifs of 
the TF for the given conditions. 



INTRODUCTION 

The ability of every living cell to properly respond to 
diverse stimuli depends on the genetic information encoded 
in its genome and signaling cascades that activate ap- 
propriate transcription factors (TFs) for gene regulation 



(1-3). To understand the global network of transcription 
for controlling diverse cellular responses, it is important to 
identify the regulatory modules that are responsible for 
spatial or temporal gene regulation. For this purpose, 
diverse integrative tools for genomic analysis of DNA se- 
quences, accompanied by information on the transcrip- 
tome and interactome, have been actively developed (4). 

High-throughput technologies, such as ChlP-chip, 
ChlP-PET and ChlP-Seq, allow genome-scale mapping 
of epigenetic modification and protein-DNA interactions 
in particular genomes (5,6). Integration of accumulated 
genome-wide experimental data with DNA sequence in- 
formation allows the construction of a map of the tran- 
scriptional regulatory circuits encoded in a genome that 
can eventually lead to the identification of the regulatory 
modules for gene regulation. However, annotating the 
functional transcription factor binding sites (TFBS) in 
the regulatory modules remains a challenging task (7). 
The problem derives mainly from the nature of the 
DNA sequences that are recognized by transcription 
factors; they are relatively short and degenerate. Further- 
more, transcription factors are known to recognize more 
than one consensus sequence (8), and similar DNA se- 
quences can be recognized by different groups of tran- 
scription factors (9). 

Because accurate prediction of the putative binding sites 
of transcription factors is a valuable tool for understand- 
ing transcriptional regulatory networks and mechanisms 
of transcriptional control, numerous computational 
tools have been generated. The most common method 
is the pattern matching approach that uses a position 
weight matrix (PWM) (10-13) or Hidden Markov 
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Models (HMMs) (14). However, prediction of the putative 
TFBS using the predefined PWM suffers from a high rate 
of false positive discovery (15). To alleviate this problem, 
integration of the heterogeneous information (16), such as 
the DNA sequence conservation score (17,18) DNase-I 
hypersensitive score (19,20), or nucleosome occupancy 
(21) and modification information (22,23), has been suc- 
cessfully applied with enhanced prediction performance. 

In parallel, approaches using PWM clustering based on 
the sequence similarity were proposed. In this method, a 
familial binding profile (FBP) is constructed from the 
multiple PWMs for each family of transcription factors, 
improving the sensitivity of de novo motif discovery algo- 
rithms (24,25). However, a FBP ignores the flanking pos- 
itions of PWMs that are not aligned but which may be 
important for discriminating false positives; hence, this 
method can have low specificity in predicting functional 
binding sites. An alternative approach is to combine 
overlapping TFBSs predicted by the original PWMs be- 
longing to the same cluster (15). This program can 
increase specificity by removing redundant TFBSs, but 
because it is based on the heuristic scoring system, it is 
not suitable for comparing scores of overlapping TFBSs. 
To overcome these problems, we have recently developed 
a motif-based scanning program (26). It searches STAT 
TFBS of high affinity scores, using the combined predicted 
TFBSs from PWMs that show similar binding specificity 
to STAT family members. 

In an attempt to construct an efficient computational 
tool for predicting TFBSs, we applied the motif-based 
scanning program to other transcription factors with 
multiple PWMs. A total of 368 transcription factors and 
565 PWMs were considered in this study. Finally, TFBS- 
scanner was applied to identify the co-occurring c«-motifs 
that might function coordinately. The source code of 
TFBS-Scanner program is freely available from the sup- 
porting webpage, https://sourceforge.net/p/tfbsscanner. 



MATERIALS AND METHODS 

Analysis of the TF-PWM network 

To construct linkages between PWM and its cognate TF, 
we manually extracted 368 vertebrate TFs and 565 verte- 
brate PWMs information from the 'Matrix' and 'Factor' 
database of TRANSFAC professional version 9.4 
(Supplementary Table SI A) (11). To map mouse or rat 
gene to the human gene identifier, Entrez Gene ID (27) 
(from FTP: Gene), 'HomoloGene' information (28) (from 
FTP: HomoloGene, build63) of the NCBI were used. 
Cytoscape 2.7.0 (29) was then used to visualize the TF- 
PWM interactions (Supplementary Table SI A); each node 
indicates TF or PWM, and edges for pairwise interaction. 
A total of 474 PWMs and 368 TFs made up the final 
graph, which contains multiple connected components 
(CCs) (Supplementary Table SIB). See the 'TF-PWM net- 
work. cys' or PDF files for each CC in the supporting 
webpage for detail view. For protein interactions 
between TFs within each CC, a total of 51 837 annotated 
PPI information has been extracted from the 'interactions' 



information (27) (from FTP: Gene) of the NCBI. 
For detail view, see the 'PPI in TF-PWM network. cys' 
or PDF files for each CC in the supporting webpage. 
The number of possible PPI is given by N x (N—l)/2, 
where N is the number of TFs in each CC. To compute 
the distance between two PWMs, we used the distance 
measure proposed by Harbison et al. (30). Given two 
PWMs ©i,02 that are already aligned, the distance 
measure is defined by: 

w=l v/ 1=1 

where W is the length of aligned positions. For the 
optimal alignment between the two PWMs, we used the 
procedure described by Narlikar et al. (21). 

Affinity score of a subsequence 

To improve the generalization performance of the 
observed position count matrices, we transformed each 
position count matrix into a position frequency matrix 
(PFM) by adding position-dependent pseudo-counts. 
We used the statistical method of Rahmann et al. (31) 
for position-dependent regularization. For highly con- 
served positions, we add very small pseudo-counts 
in order to maintain the strong signal. In contrast, we 
add relatively large pseudo-counts at poorly conserved 
positions, preserving the overall composition of nucleo- 
tides of the position count matrix. Given a regularized 
PFM 0] e R**'* 4 , we represent a sequence Sj as a set 
of overlapping subsequences sW = (sy, ■ ■ ■ , s^w-i)) of 
length W to scan TFBSs by sliding a window of 
length W. We assume that a subsequence can be 
generated from either the PFM or a background model 
0 o = [6> o r ,---,0 o r ] r eR H/x4 , where 0 0 is defined by a 
zero-order Markov chain. We used six different back- 
ground models from our previous study (26) to capture 
the compositional bias of GC content in genomic se- 
quences. Then, we converted the regularized PFM into a 
PWM T by computing the log-odds scores between the 
PFM and a background model, which is given by: 



We decide whether a given subsequence is generated from 
the PFM or the background model based on the sum of 
log-odds score: 

w 4 
w=\ 1=1 

Following the statistical method of Rahmann et al. (31), 
we computed the exact distribution of the sum of log-odds 
scores to measure the statistical significance of the 
log-odds scores of the subsequences. The exact distribu- 
tion can be efficiently computed using the positional inde- 
pendence of PWMs and then applying convolution. From 
the exact distribution P@ 0 given the assumption that the 
subsequence is generated from the background model, we 
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computed the type I sequence error probability a n {Sy) of 
the log-odds score X(sW), which is given by: 

a„(sf) « 1 - exp(-«P ( , 0 (x > X(sf))). 

This error quantifies the probability that at least one 
TFBS occurs within a background sequence of length n 
[we set n = 500 as proposed by Rahmann et al. (31)]. 
Finally, the affinity score y(s^) of the subsequence is 
defined as y(Sy') = 1 — a„(s^). Note that the affinity 
score is not dependent on the length of the PWM. 
Therefore, we can directly compare the affinity scores of 
different PWMs. 

Clustering PWMs 

We constructed three different kinds of PWM clusters for 
each TF. PWM cluster type 1 is constructed from all 
PWMs within the CC to which a query TF belongs. 
PWM cluster type 2 consists of all PWMs that are 
directly linked to the TF of interest. PWM cluster type 3 
consists of high-quality PWMs, which were selected based 
on the following procedures. First, each PWM was 
assigned a quality score that quantifies how well a given 
PWM detects true binding sites over noisy sequences (see 
'Quality score of PWMs' below). PWMs with low quality 
scores (<0.7) were discarded as described previously (26). 
Second, the PWM with the highest quality score among 
PWMs of PWM cluster type 2 was selected as the 'repre- 
sentative PWM'. Finally, the inter-motif distance between 
the representative PWM and the PWMs within the CC 
was calculated. PWMs with a distance above the cut-off 
value (>0.1653) were removed; the value of the distance 
cutoff was chosen as described previously (26). For the 
comparison of distance measurements between STAMP 
(32) and our used method (26), we tested the significance 
of overlap between two sets of top 9 ranked PWMs 
excluding the top ranked self PWMs in all 565 PWMs, 
from the default setting of STAMP and our PWM 
distance measure. The statistical significance was 
evaluated by calculating the enrichment P-value based 
on the hypergeometric distribution. 

Quality score of PWMs 

The quality score Q(0i|0o) of a PWM 6i quantifies how 
well the PWM is separated from a background model ©o 
(31). We first define the type II error probability of the 
log-odds score at the given threshold t, given by: 

= P &1 (x < i). 

The quality score is then defined by: 

2(0,100)= 1 -a„(f), 

where the threshold t* is such that a„(t) — p(t). The 
quality scores of all the PWMs used in this study are avail- 
able in Supplementary Table SI A. 

Construction of TFBS-Scanner 

We reconstructed PWMs from TRANSFAC 9.4 (11) to 
evaluate the statistical significance of the predicted TFBS. 
Given a PWM cluster for a query TF, the TFBS-Scanner 



takes a DNA sequence as input to search for putative 
TFBSs of the TF. It then chooses a PWM cluster of the 
TF from the pre-compiled library of PWM clusters and 
searches all TFBSs of the chosen PWMs at the specified 
cutoff value of the affinity scores. Finally, all overlapping 
TFBSs of the PWMs are combined into one with the 
maximum affinity score. 

Z-score of the AUC values 

To account for the compositional bias of GC content 
within the regulatory regions, we defined the Z-score of 
the AUC value as: 

AUC,,- — a, 

Zij = - 

where AUC,y is the AUC value of the rth TF at the y'th 
ChlP-Seq data set. Here, /x; and ay are the mean and 
standard deviation of AUC values of the rth TF among 
the 51 ChlP-Seq data sets. 

ChlP-Seq data sets used in this study 

We compiled 51 ChlP-Seq data sets from the 'Table 
Browser' of the UCSC genome browser (33) or from 
other studies (34-36) (Supplementary Table S2D). To 
evaluate the classification performance of the TFBS- 
Scanner, we constructed 51 test sets consisting of 
positive and negative sets. For a positive set, we used 
the binding regions for each ChlP-Seq data set. We 
used, as suggested by Whitington et al. (23), the binding 
regions defined by the authors if the length of the binding 
region is larger than 500 bp. Otherwise, we expanded the 
binding regions to 500 bp. A negative set was also con- 
structed by randomly sampling nine sequences for each 
binding region, following the method of Ernst et al. (16). 
The nine negative sequences were sampled from the 
non-gapped regions of the same chromosome as the 
binding region, excluding any overlapping sequences in 
the positive set. 

A binary classifier for ChlP-Seq data sets 

Given an input sequence sy and a PWM cluster {0,t}, the 
binary classifier for plotting the ROC curve is given by: 

Asi\{®k}) = mean^ max jy{sj k ), 

where W k is the length of a PWM & k and y{sj k ) 15 tne 
affinity score of subsequences of the input sequence. 

Finding overrepresented motifs by MEME 

To find the overrepresented motifs from ChlP-Seq data, 
we used the program MEME (37), which is one of the 
most popular motif discovery algorithms. We selected 
input sequences (500 bp) from the top 100 binding 
regions for each ChlP-Seq data set. Among a total of 51 
ChlP-Seq data sets, we only considered 36 data sets 
released by ENCODE Yale TFBS because they provide 
P-values for each binding region. We used MEME with 
the following parameters: the distribution of motif occur- 
rences: ZOOPS; the number of different motifs: 10; 
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the minimum motif width: 10; and the maximum motif 
width: 20. From the 10 learned motifs, we chose the one 
that was most similar to the representative PWM of 
TRANSFAC by visual inspection. 

RESULTS 

TFBS-Scanner: predicting TFBSs by clustered PWMs 

To apply a motif-based scanning program to every TF 
with multiple position weight matrices (PWMs), we first 
collected the available information of PWMs and their 
interactions with cognate TFs. A total of 368 TFs and 
565 vertebrate PWMs from the TRANSFAC database 
were considered; each PWM was linked to its associated 
TFs when the interaction was supported (see 'Materials 
and Methods' section). The resulting TF-PWM network is 
a bipartite graph whose nodes (368 TFs and 474 PWMs) 
are linked to each other (Supplementary Table SI A). The 
average number of TFs connected to a single PWM was 
1.53, and a total of 61 (10.8%) PWMs had only one TF 
linkage (PWM:TF = 1:1). The V$EBOX_Q6_01 had the 
largest connection, with 30 different TFs (Figure 1A). 
Multiple PWMs were also found to be connected to 



single TFs; the average number of linked PWMs per TF 
was 2.35 (Figure IB). 

The TF-PWM network consists of 134 multiple con- 
nected components (CCs), defined by the TF-PWM 
subgraphs in which any two nodes within the same CC 
are connected to each other through one or more edges 
(Supplementary Table SIB, 'TF-PWM network. cys' in the 
supporting webpage). The largest CC, CC-3, is con- 
structed with 69 nodes (36 TFs and 33 PWMs) and 98 
edges, whereas the smallest one contains 2 nodes (1 TF 
and 1 PWM) and 1 edge (Figure 1C). Each CC elicits 
distinct connectivity between PWM and TF; the average 
number of PWMs linked to TFs and the average number 
of TFs connected to PWMs within each CC varies 
(Figure ID). 

Using 51 837 annotated protein-protein interactions 
(PPI) from the 'Interaction' information of the NCBI, 
physical interactions between TFs (Supplementary Table 
SIC) within each CC have been investigated (See 
'Materials and Methods' section). We also checked that 
80% of these TF— TF interactions overlapped with the 
human protein-protein interactions (Supplementary 
Table SID), which were physically identified by the 
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The predicted TFBSs are combined and accumulated affinity scores are calculated. 



Mammalian Two-Hybrid Assay (38). Multiple TFs found 
in the same CC are likely to interact with each other when 
the total number of TFs in the CC is relatively small 
(Figure IE). In contrast, as the number of TFs in the 
CC increases, their physical interactions become weaker. 
In the case of CC-1, which consists of 20 TFs, including 
the PPAR and NR family members, only 42 protein- 
protein interactions have been annotated out of 190 
possible interactions (Supplementary Figure SI, 'PPI in 
TF-PWM. cys' in the supporting webpage). The sequence 
similarity among multiple PWMs in the same CC was then 
measured. Interestingly, DNA sequences of the PWM 
within each CC were quite similar, which was not signifi- 
cantly affected by physical interactions between TFs in the 
same CC (Figure IF). These data indicate that the TFs in 
the same CC might be able to recognize similar DNA 
sequences. 

Analysis of the TF-PWM network clearly shows that 
most of the TFs are connected to multiple PWMs. 
Therefore, to examine the effect of PWM clustering on 
the efficiency of TFBS prediction, we compared three dif- 
ferent types of PWM clusters (Figure 2A, Supplementary 
Table S2A-C; see 'Materials and Methods' section). The 
TFBS-Scanner takes a DNA sequence as input to search 
putative TFBSs of any given TFs (Figure 2B). For each 



query TF, TFBS-Scanner searches a pre-compiled library 
of the TF-PWM network and loads multiple PWMs for 
each cluster type. Using the selected set of PWMs, 
TFBS-Scanner screens the query DNA sequence for 
TFBSs with affinity scores higher than a pre-defined 
cut-off value. In the end, overlapping sites among pre- 
dicted TFBSs are combined to calculate the maximum 
affinity score. 

Validation 

To examine the efficacy of TFBS-Scanner, we used 51 
genome-wide ChlP-Seq data sets analyzing 26 vertebrate 
TFs (Supplementary Table S2D). For each ChlP-Seq data 
set, the positive (derived from the peak regions) and the 
negative (background) set of DNA sequences were 
prepared (see 'Materials and Methods' section). The per- 
formance of TFBS-Scanner was evaluated based on the 
standard receiver operating characteristic (ROC) 
analysis, which plots a ROC curve by varying the thresh- 
old of the output of a binary classifier. We designed a 
binary classifier to score each input sequence by averaging 
the maximum affinity scores of subsequences among the 
given PWM cluster (see 'Materials and Methods' section). 
To quantify the performance, the area under the ROC 
curve (AUC) was computed to range from 0 to 1 (for 
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perfect classification). Because a classifier randomly 
guessing the class label of an input has an AUC value of 
0.5, AUC values less than 0.5 have no practical meaning. 

Of the 26 TFs tested, PWM cluster type 2 performed 
better than types 1 and 3 (Figure 3 A and B). In addition, 
prediction efficiency of PWM cluster type 2 was higher 
than that of single representative PWM (Supplementary 
Table S2A-C, Figure 3C). Since PWMs were originally 
built using a limited number of experimentally verified 
TFBSs, the number of the real TFBSs used to create 
each PWM is usually biased. It raises the concern that 
not every PWM might be able to represent real binding 
in vivo. Therefore, we built the single PWMs from the 
ChlP-Seq data set using the de novo motif discovery algo- 
rithm MEME (37), and compared the AUC values of the 
derived PWMs with that of PWM cluster type 2 
(Figure 3D). It is noteworthy to mention that the dissimi- 
larity value of the clustered PWMs of TFs used in our 
analysis was significantly lower than all PWM clusters, 
which might contribute to enhanced TFBS prediction 
(Supplementary Figure S2A). To evaluate the distance 
measure which we used for PWM similarity calculation, 
we compared results from STAMP (32) and our used 
method by calculating the enrichment P-value based on 
the hypergeometric distribution (Supplementary Figure 
S2B, See 'Materials and Methods' section). We observed 
little differences between the two distance measures. 

To evaluate the performance of our PWM clustering 
methods, we then compared the efficiency of PWM 
cluster type 2 to other PWM clusters provided by 
SwissRegulon (39) (Figure 3E), or by the JASPAR 
FAM database (40) (Figure 3F). The PWM cluster type 
2 showed significantly improved performance compared 
to the SwissRegulon (P < 0.0001) or to the JASPAR 
FAM database, which provides 11 FBPs based on the 
structural classification of TFs (P < 0.0063; Figure 3G). 
To test the significance of clustering PWMs, we then 
generated random clusters for each ChlP-Seq data set 
by selecting the same number of PWMs at random 
(simulated by 20000 times), and computed empirical 
^-values of AUC scores. The PWM cluster type 2 
showed significant AUC scores for most of the tested 
ChlP-Seq data sets except for data sets with low AUC 
scores (Figure 3H). These results indicate that the PWM 
clustering is useful to improve the efficiency of TFBS 
prediction. 

Both experimental and computational analysis (16) 
recently reported that TFBSs are preferentially located 
in the genomic regions scored by general binding prefer- 
ence (GBP). It uses integrated heterogeneous data of epi- 
genetic (22,23), DNase-I hypersensitivity (19,20), and 
sequence conservation among species. Peak loci for most 
of the tested ChlP-Seq data sets were largely located in the 
GBP scored regions with higher AUC score (Figure 31). 
By combining the affinity score of TFBS-Scanner with the 
GBP score, prediction performance was significantly 
improved (Figure 3J), indicating that integration of the 
PWM clustering method along with heterogeneous 
motif-independent data set will be more powerful for 
TFBS prediction. 



Searching co-occurring c/s-motifs using TFBS-Scanner 

Because it is of great interest to understand the TF 
complexes that cooperatively act in the promoter or 
enhancer regions (5,41), numerous computational tools 
have been generated to model representative multiple 
motifs or regulatory modules that present in the 
promoter sequences that elicit correlation with their gene 
expression (42-47). Therefore, we asked whether 
TFBS-Scanner could be used to determine co-occurring 
motifs of TFs using ChlP-Seq information, without inte- 
gration of the gene expression data. For this purpose, we 
developed a simple but effective computational tool based 
on clustered PWMs. 

If TF-Y binding to DNA is sequence specific and it 
cooperatively acts with TF-X, then a statistically signifi- 
cant portion of the binding sites for TF-Y will appear in 
the binding regions of the TF-X. Based on this assump- 
tion, we searched for binding sites that frequently 
co-occurred in the ChlP-Seq data set of TF-X. For this 
purpose, the AUC values for 368 TFs in the TF-PWM 
library were first computed using 51 ChlP-Seq data sets. 
However, due to the compositional bias of GC contents in 
the regulatory regions (48), the higher mean percentage of 
GC in each PWM cluster arbitrarily shows higher AUC 
values (Figure 4A). As a result, PWMs with high GC 
content were found to be highly enriched for most of the 
ChlP-Seq data sets (Figure 4B). To solve this problem, 
the AUC values were converted to a Z-score, where the 
sample mean and standard deviation of the AUC values 
are computed for each TF among a total of 51 ChlP-Seq 
data sets. The Z-score was not affected by the GC contents 
of the PWM clusters (Figure 4C and D). 

Modeling co-occurring cis-motifs that are conserved in the 
diverse cell type: a case of E2F family 

Family members of the E2 factor (E2F) family play 
diverse roles in the transcriptional regulation of cellular 
proliferation and differentiation. Because E2F family 
members share a high degree of structural and biochem- 
ical similarity and recognize similar E2F DNA sequences 
(49), it was of interest to identify the co-occurring motifs 
in the E2F occupied genomic loci. For this purpose, four 
ChlP-Seq data sets were analyzed in the diverse cellular 
system using either E2F1 or E2F4 (Supplementary Table 
S2D). In every assay, binding sequences specific for E2F 
were highly enriched at the peak loci, as determined by 
TFBS-Scanner using PWM cluster type 2 (Figure 5A). To 
identify the co-occurring DNA motifs in the E2F binding 
regions, Z-scores of the 368 TFs were then calculated 
(Figure 5B and Supplementary Table S3). As expected, 
TFBSs of the E2F1, E2F2, E2F3, E2F4, E2F5 and 
E2F7 exhibited the highest Z-score and the distribution 
of the normalized cumulative affinity score of these sites 
in the binding loci overlapped (Supplementary Figure S3). 
Along with E2F, PWMs of distinct TFs, such as AHR, 
ARNT, ETF, HIC1, EGR, NRF1, SP1 or TEAD, were 
calculated to have higher Z-scores than the cut-off value. 
Although binding sequences for these TFs were quite dis- 
similar to that of E2Fs (Supplementary Figure S4), distri- 
bution patterns of the normalized cumulative affinity 
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score of these TFs in the ChlP-Seq peak loci were quite 
similar to that of E2F (Figure 5C), indicating that these 
TFs might act in conjunction with E2F for target gene 
regulation. In support of this prediction, functional inter- 
actions between E2F and AHR, ARNT, HIC1, EGR, SP1 
and SP3 for transcriptional regulation have been previ- 
ously reported (50-56). In addition, TFBSs for NRF1, 
TEAD2, and ETF are known to co-occur with that of 
the E2F family in their target genes (57,58). 

Modeling signal-specific co-occurring cis-motifs: a case 
of JUN family 

We next asked whether our method could be used to 
identify signal-specific co-occurring motifs. For this 
purpose, we chose four sets of ChlP-Seq data assayed 
with JUN in K562 cells after IFNa or IFNy stimulation 
(Supplementary Table S2D). JUN is a basic region-leucine 
zipper protein and belongs to the family of AP-1, which 
consists of JUN, FOS, MAF and ATF subfamilies (59). 
Activated AP-1 recognizes and binds to TGA(C/G)TCA 
with high affinity or TGACGTCA with low affinity, and 
acts as a regulatory factor of diverse cellular processes, 
such as proliferation, transformation and apotosis (59). 
Cooperative interactions between AP-1 and other TFs in 
the promoter have been suggested as the regulatory mech- 
anism of target gene specificity governed by AP-1 (60). 

For un-stimulated K562 cells, JUN binding loci did not 
possess prominent JUN binding sequences, as judged by 
conventional SwissRegulon or by TFBS-Scanner using 
clustered PWMs (Figure 6A). However, upon stimulation 



with IFNa, binding sequences for JUN became dominant 
in the peak loci of JUN ChIP. Along with JUN, motifs for 
BACH 1 , FOS, GATA, MAF or NFE2 also exhibited sig- 
nificantly higher Z-scores than the cut-off value, which 
became prominent after stimulation (Figure 6B 
andSupplementary Table S3). The distribution of the 
normalized affinity scores of these TFs overlapped with 
that of JUN, and their co-occurrence became stronger 
after stimulation with IFNa (Figure 6C). In the JUN 
ChlP-Seq data sets for IFNy stimulation, major 
co-occurring motifs were very similar to that of IFNa 
stimulation (Supplementary Table S3). These data 
suggest that JUN might coordinate with BACH1, FOS, 
GATA, MAF or NFE2L2 for gene regulation specific for 
IFNy or IFNa- In support of this prediction, inter-relation 
between the AP-1 motif and c/'.v-element of GATA or NFE 
have been previously reported (61-63). Similar to JUN, 
binding sites for MYC were significantly enriched at the 
peak loci of the MYC ChlP-Seq after IFNa or IFNy 
stimulation, along with co-occurring motifs of 
NHMHB2/3, HAND 1/2, HMX, MAX, MYB, USF1/2 
or XBP1 (Supplementary Figure S5). To evaluate the per- 
formance of clustered PWM over single PWM, we finally 
compared the Z-score of PWM cluster type 2 with that of 
the representative PWM (Supplementary Figure S6). 
In the both E2F1 and JUN ChlP-Seq data set, PWM 
clustering usually showed better performance than the 
representative PWM. These results suggest that our stat- 
istical framework using clustered PWMs is useful at iden- 
tifying co-occurring motifs of TFs. 
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DISCUSSION 

The code for gene regulation is widely accepted to be 
encoded in the four-letter alphabet of the genome, but 
decoding the rules for the transcription regulatory 
circuits that govern diverse cellular responses remains 
challenging. To reconstruct the transcriptional regulatory 
network, an important intermediate step is to efficiently 
predict the binding sites of TFs in the regulatory regions. 
In this article, we developed a computational framework 
for predicting the putative TFBSs for a group of TFs for a 
DNA sequence of choice. Our approach uses modified 
PWMs based on a previous statistical framework (26) 
and clustered PWMs based on their sequence similarity 
and quality. The TFBS-Scanner utilizes 368 TFs and 565 
PWMs data sets, which are available in an up-to-date 
public database. 

The performance of most conventional TFBS predic- 
tion tools heavily depends on the selection of a PWM 
that represents a genuine motif for a given TF. 
However, without a reference set of binding for TFs, it 
is impossible to identify the best-performing PWMs from 
pre-existing pools. Furthermore, accumulating evidence 
suggests that a number of TFs can be associated with 
more than one binding motif, and a single PWM can be 
recognized by functionally distinct TFs (5,8). Therefore, 
conventional motif scanning programs, which assume a 
single PWM for a single TF, have a fundamental 
problem. To maximize prediction efficiency, we strategic- 
ally developed a quality scoring and clustering tool for 
PWMs. Using the genome-wide ChIP binding peaks, 
we were able to identify the set of the reference PWM 
for a given TF and a PWM clustering type that maximizes 
prediction efficiency. Findings of this article support 
the idea that approaches combining multiple PWMs 
associated with a given TF outperform algorithms 
relying on a single PWM. In addition, it is noteworthy 
to mention that integration of the GBP score (16) 
further improved the prediction efficiency of the TFBS- 
Scanner. 

In general, there is growing evidence to support the view 
that a single motif cannot explain all in vivo binding 
regions (5). In addition, in vitro binding of the purified 
TF protein demonstrates that it can recognize more than 
one species of the binding motifs (8). Therefore, using a set 
of qualified PWMs rather than a single PWM to identify 
in vivo binding regions should be more effective and prac- 
tical. However, it is worth noting that our clustered 
PWMs cannot detect binding regions of TFs that are in- 
directly bound or that have bindings dependent on other 
factors that act in neighboring regions. Although a TF has 
the ability to recognize DNA sequences directly, not every 
TF solely depends on direct interaction with DNA se- 
quences in vivo. Instead, protein-aided recruitment by 
the TF adjacent to other TFs may facilitate the formation 
of stable transcription regulatory complexes (41,64,65). In 
recent reports, overlapping localization on the genome 
loci and combinatorial interactions among TFs were 
strongly suggested as one of the features that explains 
complex transcriptional regulation of tissue or condition 
specificity (66,67). 



To identify co-occurring motifs of gene regulation, 
several approaches have been proposed. Identification 
tools of the enriched motifs from ChlP-chip or 
ChlP-Seq data sets can be grouped into two classes 
based on their underlying assumptions. The methods be- 
longing to the first class are based on the assumption that 
a true motif is located at the centers of the binding regions, 
whereas insignificant motifs are uniformly distributed 
(68). However, these approaches require a cutoff to scan 
the TFBSs of a predefined motif. The second class, which 
is based on the assumption that a true motif should be 
overrepresented in the binding regions compared to back- 
ground sequences, overcomes the drawback of the cutoff 
for scanning TFBSs by using standard ROC analysis 
(69,70). Although these approaches can reduce the per- 
centages of false positives by taking into account the GC 
content of background sequences, it could actually filter 
out a true GC-rich motif. In contrast to these approaches, 
our method randomly selects the background sequences 
without considering the GC content, and then eliminates 
false positive GC-rich motifs by normalizing the AUC 
scores. Because our method utilizes Z-scores to select stat- 
istically significant motifs, its efficiency depends on the 
size of the ChlP-Seq data sets to approximate the exact 
distributions of the AUC scores of each TF. 

The genome encodes the blueprints of every possible 
transcriptional regulatory network. A condition-specific 
regulatory network of transcription emerges from 
multiple protein-protein and protein-DNA interactions, 
activated by specific signaling pathways. Our computa- 
tional approach provides the groundwork to build a 
map of these potential networks. Accompanied by 
genome-wide information of all protein-protein and 
protein-DNA interactions, our program will serve as a 
helpful tool to reconstruct the functional network that 
governs specific cellular responses. 
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